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Abstract 

Social networks are analyzed as graphs under the scope of discrete mathematics which have a 
great range of applications in different contexts such as: technology, social phenomena and biolog- 
ical systems. At the present this theory gives a set of tools for a phenomenological analysis that 
would be difficult or almost impossible with a different approach. In this work social networks 
for different technical communities from electronic mail and "News" in Spanish language are con- 
structed. The algorithm was based on the use of RFC2822 standards and RFC1036 to arm threads 
of messages. The results are quite different from that obtained by another kind of community as the 
jazz musicians community. Nevertheless they show an analogy to random graphs obtained by the 
"Configuration Model" method. This points the attention that some generalization assumptions 
are not correct. 

PACS numbers: 05.45.-a 
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I. INTRODUCTION 



Complex social networks associated to Internet, like as e-mail lists, news services etc. 
do not have a structured architecture like a project in network communication engineering. 
They show some kind of synergia given by the great amount of users who form the mentioned 
community. This study faces some problems, some time neither cultural guidelines are taken 
into account and the results are generalized with quite different guidelines. This is why 
we have restricted our analysis to one language and to the technical communities as an 
experimental application of the theoretical tools in social networks. This work is organized 
in two sections: the first one is devoted to a brief introduction on social networks and the 
other is referred to the study of communities coming from e-mail and "News Services" in 
Spanish language. 

A. Networks 

A social network is a set of relations (links or edges) among different elements (nodes, 
vertices or actors). Formally a network is a graph G = (V, E, 7) where V is the set of 
vertices, E is the set of edges and 7 : E — > V such 7(e) = {v, w}. That means, 7(.) to each 
edge a pair of vertices are assigned and they are known as ends of edge. Recently for the the 
networks study binary matrices are used, therefore an isomorphism exists / : G — > B n , where 
B n is the set of square binary matrices of dimension nxn. This matrix is called adjacency 
matrix (AM)i. Social scientists defined by convention that actors (output, egos) are placed 
in the rows, while the attributes or related actors (input, alter) in the columns. The AM is 
symmetric since we are dealing with non directed graphs^. There are multiple graphs in which 
more than one kind of edges are identified as: (kinship, friendship); (theme, author); etc. this 
would be quite useful for building social substructures although for building the required AM 
is more complicated. Usually we regard an associated AM with some particular projection 
i.e. kinship. Otherwise graphs may be weighted, that means it is possible to assign a weight 
to each link. This gives us a non zero value associated to each AM element. As can be seen 
the AM has all sensitive information related with the social network in particular. It is worth 
to notice that the diagonal elements in the matrix are filled with zeros or are neglected in 
the algorithms since the self interaction has no sense. However, then have to be considered 
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in their booleans products^. On the other hand some other properties can be associated 

with the AM. "Macroscopic" properties among the actors as Path: Is a concatenation of 

vertices connected by edges so that no chosen edge is repeated n = {xi,x 2 , • • • ,x n } where 

X\ is the initial vertex and x n the final vertex—. Geodesic: is the path of minimal length^ 

among actors. If it is not exist, as in the case of the non connected graphs, the infinite value 

is taken like the length. The geodesic path is the optimal path between actors, because 

socially, are actors with strongest links. Also there are "Microscopic" properties as clique. 

This is a measurement of the transitive triades of the network. Two definitions exist, an 

originating one of theory of graphs. Which knows it like transitivity a graph: 

= 3 x # de triangles 
# de triplets 

where # is the cardinality of the set and the other one was formulated by Watts y Strogatz, 

known as clustering associated with the actor % is defined as: 

= # of edges in i 
1 Maxima edges# in i 

and then averaged over the actors C(G) =< C* > this is the "clustering"—. 

As much in a case as in another one, the transitive triades are small groups that represent 

the balance or the natural state towards which tend the social relations. But in either case 

they are small groups. Another "Microscopic" property is the average degree of connection 

among the closest neighbors (CN)& defined as: 

K nn (k) = Y,k' P(k/k') (1) 
k> 

where P(k'/k) is the conditional probability the a vertex of k degree is connected with 
another one with k' degree. On the other hand when K nn (k) is an increasing function 
the network is called associative, and if is decreasing is called dissociative. Also we may 
characterize the network from rows histograms, known as prestige of the actor this is 
coincidently with the columns histogram known popularity of the actor. But at the 
moment another kind of statistics is used called the connectivity probability, P{k). P{k) is 
the probability that a randomly chosen vertex have k edges^. 

According with the functional shape of the histogram's tail (k — > oo) the network can be 
classified as exponentials, when P(k) ~ e" Afc ; scale-free, when P(k) ~ fc~ 2 ~ 7 with 7 > 0; 
broad-scale, when is "scale-free" but with an abrupt cutoff; and single-scale, when has a 



fast asymptotic decayi In mostly field experiments the scale-free network was shown as 
the dominant. In order to estimate the exponent the cumulative probability P(k' > k) is 
used& defined as: 



if P(jfe) ~ fc-T in the tail, then F(k) ~ / P(k') dk ~ fc-? +1 . 

II. APPLICATION TO E-MAIL SPANISH LISTS 

All the non ponderated properties of the relation among actors are included in the AM. 
Each index row is associated with the actor who generates the subject, author root and each 
column index with the actors involved in the thread of conversation author descendent. 
A lexicographical arrangement is not used but by prestige^, in other words, a low index 
is assigned to the actor with the greatest absolute frequency and successively until the 
index of greater value corresponding to the author of less prestige. For the construction of 
the matrix we have not taken into account the self-answers and no the threaded demands 
(without thread). Due to this fact they are discarded in previous phases to the application 
of the algorithm. 

A. Algorithm used for the message threading 

According with the standard RFC 822 and derivate RFC 2822 and 1036^ the transmission 
format of the messages coming from electronic mail and News services should be composed 
by some headers fields and a body in plain text. From all the fields sent in a single message 
the following fields are required to construct threads of messages:: 

From: This field contains the identity and direction of the person who sends the message 
Subject: This field indicates the nature of the message. 

Message-ID: This field must be unique for each message the suggested format is 
"<local_part@domain>" . 




(2) 
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In-Reply-To: refers to a or the Menssaje-ID where is the message is answered, if the 
message is new this field does not appear. 

Before the use of threading algorithm, the messages go through a filter which extract 
the field of interest and delete unwanted message. This filter is written inPERL^. Since 
mostly of the codes for messages threading are not free domain it was necessary to write 
our own code in "C" language from an existing one GPL^ in "JAVA" language, modified 
for generating a list from the actors sequence related with a thread, having as the list root 
the actor whose give the beginning of the thread. This algorithm have prove its robustness 
in hundred thousand trials. 

A brief sketch of the algorithm is given. The algorithm is based on the handling of 
connected structures of data which are: 

Container-C 

Message message; 
Container parent; 
Container child; 
Container next; 

>; 

The field "message", may be NIL. The structure "Message" have the following fields: 

Mensaje-[ 

char* Subject; 
char* Message_ID; 
char* In-Reply-To ; 
char* From; 

>; 

When the field "In-Reply-To" , Take the value NIL is indicating the message father. An 
indexed table is associated where in index is "Message_ID" from the message parent. Then 
a "Container" root or parent is associated. After that using the threading algorithm a 
message data base descendent associated is built. A table ordered by absolute frequencies of 
appearance of the "Author" of the message father is generated with decreasing order. Finally 
a AM is built from the previous results. For algorithm details see "Message Threading of 
Jamie Zawinsky, technical report 1 —. 
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B. Analysis of the obtained Adjacency Matrices 

We have taken as leading cases two mailing lists in Spanish which represents the obser- 
vations done in previous studies. One is a purely technical list called LU and another one 
with the same actors but with a more general thematic called MIX. This work have as null 
hypothesis those obtained from the algorithm " Configuration Model " (CM)& composed by 
1200 vertices. This allow us to build random networks with a given probability density of 
edges by vertex, P(k). We have used P (k) ~ fc -7 with 7 = 2.1, since for this value stands 
the behavior for k — > 00 for the null hypothesis and the real cases. As can be observed in 
fig-[I]in both cases the asymptotic behavior of k nn is agree with those proposed by the CM. 
This dissociated mixed behavior is quite different to those found in jazz communities^ or 
scientific collaborators network^, due their behavior is purely mixed associative. 

The following is a comparative table between the calculated values of " clique" , C by using 
Watts and Strogatz's algorithm and the averaged geodesic G for each case. 





CM 


LU 


MIX 


c 


0.50 


0.7 ±0.3 


0.9 ±0.2 


G 


3.63 


3.36 ±0.02 


2.83 ±0.02 



In order to calculate these parameters for real cases, non connected graphs were taken 
into account, that means that is not dense the closure adjacency matrix obtained from the 
Warshall algorithm. This can be observed in fig-El This show different values from the 
averaged geodesic due to the fact that some actors are not linked. Therefore we adopted an 
ad hoc criterion. The parameters were calculated in the maximal dense subgraphs where are 
the more popular actors which is according with in situ observations. That is, the behavior 
of a mailing list is given by the more related actors and not by the isolated or casuals 
answerers. Because of this the number of vertices is reduced in 60% which have no relevance 
due to the huge number of actors. 

The cumulative probability F(k) is the most significative evidence of the difference be- 
tween the null hypothesis and the real cases. As can be seen in fig-|3]the tail (k — > 00) is 
quite different from the theoretical straight line for CM as in u scale free" networks. The 
behavior is " single- s cale" . 

Instead of perform a linear fitting^ We use a quadratic fitting \og(F(k)) = a— 7 (log(fc)) 2 . 
In fig-|3]may be notice the goodness of the fitting in the tail as the intermediate range. This 
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FIG. 1: Plot of K nn (k), CM coming from the 11 Configuration Model " algorithm, LU corresponds to 
purely technical list and MIX to a list of general interest. The data were not averaged intentionally 
in order to get a better picture from the scattering of them. 

give us the possibility of discard the behavior P(k) ~ k" 1 of the tail and replace for another 
P(k) ~ fc" 7 log(fc) . Where 7 = 0.17 for LU and 7 = 0.24 for MIX. 
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FIG. 2: Plot of the transitivity clausure adjacency matrix, the dots represent the binary value "1", 
the white represent "0" , it can be notice that the complete graph is not dense. 
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FIG. 3: Log plot of F(k), where in any case F(0) = 1. On the other hand LU and MIX show an 
abrupt leap due to the abundance of impopular actors. 
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FIG. 4: Logarithmic plot where can be observed the fitting with the data. 

In worth to notice the same actors are linked in different way if the themes are different, 
in this case is not the same the tail for the list LU than for the list MIX. Therefore the 
language itself it not the unique constraint in the network behavior although the social 
paradigm where the actors are involved. This also indicates that the cumulative probability 
would be considered as a qualification element of the social behavior in this kind of societies. 

III. CONCLUSIONS 

In this work we concluded that the social relations among a set of identical actors is 
strongly linked with the social paradigm where they are involved. On the other hand, at least 
in the societies under analysis the tail behavior F(k), allows to quantify their differences. We 
may speculate and think that the dissociative character of these societies may be attributed 
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to the fact these are open societies instead of closed ones as " small-world. That means that 
no all the actors are related themselves by answering the mails and some actors cause the 
extinction of a theme by avoiding any close link. 
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